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A statistical model for predicting individual house prices and 
constructing a house price index is proposed utilizing information re- 
garding sale price, time of sale and location (ZIP code). This model 
is composed of a fixed time efi'ect and a random ZIP (postal) code 
effect combined with an autoregressive component. The former two 
components are applied to all home sales, while the latter is ap- 
plied only to homes sold repeatedly. The time effect can be converted 
into a house price index. To evaluate the proposed model and the 
resulting index, single-family home sales for twenty US metropoli- 
tan areas from July 1985 through September 2004 are analyzed. The 
model is shown to have better predictive abilities than the bench- 
mark S&P/Case-Shiller model, which is a repeat sales model, and a 
conventional mixed effects model. Finally, Los Angeles, CA, is used 
to illustrate a historical housing market downturn. 

1. Introduction. Modeling house prices presents a unique set of chal- 
lenges. Houses are distinctive, each has its own set of hedonic characteris- 
tics: number of bedrooms, square footage, location, amenities and so forth. 
Moreover, the price of a house, or the value of the bundle of characteristics, 
is observed only when sold. Sales, however, occur infrequently. As a result, 
during any period of time, out of the entire population of homes, only a small 
percentage are actually sold. From this information, our objective is to de- 
velop a practical model to predict prices from which we can construct a price 
index. Such an index would summarize the housing market and would be 
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used to monitor changes over time. Including both objectives ahows one to 
look at both micro and macro features of a market, from individual houses 
to entire markets. In the following discussion, we propose an autoregres- 
sive model which is a simple, but effective and interpretable, way to model 
house prices and construct an index. We show that our model outperforms, 
in a predictive sense, the benchmark S&P/Case-Shiller Home Price Index 
method when applied to housing data for twenty US metropolitan areas. We 
use these results to evaluate the proposed autoregressive model as well as 
the resulting house price index. 

A common approach for modeling house prices, called repeat sales, uti- 
lizes homes that sell multiple times to track market trends. Bailey, Muth and 
Nourse (1963) first proposed this method and Case and Shiller (1987, 1989) 
extended it to incorporate heteroscedastic errors. In both models, the log 
price difference between two successive sales of a home is used to construct 
an index using linear regression. The previous sale price acts as a surrogate 
for hedonic information, provided the home does not change substantially 
between sales. There is a large body of work focused on improving the index 
estimates produced by the Bailey, et al. approach. For instance, a modified 
form of the repeat sales model is used for the Home Price Index produced by 
the Office of Federal Housing Enterprise Oversight (OFHEO). Gatzlaff and 
Haurin (1997) suggest a repeat sales model that corrects for the correlation 
between economic conditions and the chance of a sale occurring. Alterna- 
tively, Shiller (1991) and Goetzmann and Peng (2002) propose arithmetic 
average versions of the repeat sales estimator as an alternative to the origi- 
nal geometric average estimator. The former work is used commercially by 
Standard and Poors to produce the S&P/Case-Shiller Home Price Index. 
We will be using this index in our analysis as it is the most well known. 

Several criticisms have been made about repeat sales methods. Theoreti- 
cally, for a house to be included in a repeat sales analysis, no changes must 
have been made to it; however, in practice, that is almost never the case. 
Furthermore, Englund, Quigley and Redfearn (1999) and Goetzmann and 
Speigel (1995) have commented on the difficulty of detecting such changes 
without the availability of additional information about the home. Goetz- 
mann and Speigel, however, do propose an alternate model which corrects 
for the effect of changes to homes around the time the house is sold. 

Even if homes which have changed are removed from the data set, an index 
constructed out of the remaining homes may still not reflect the true index 
value. Case and Quigley (1991) argue that houses age which has a depreciat- 
ing effect on their price. Therefore, Pollakowski and Wachter (1991) 
write, repeat sales indices produce estimates of time effects confounded with 
age effects. Palmquist (1982) has suggested applying an independently com- 
puted depreciation factor to account for the impact of age. 
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In a sample period, out of the entire population of homes, only a small 
fraction are actually sold. A fraction of these sales are repeat sales homes 
with no significant changes. Recall that the remaining sales, those of the 
single sales homes, are omitted from the analysis. If repeat sales indices are 
used to describe the housing market as a whole, one would like the sam- 
ple of repeat sales homes to have similar characteristics to all homes. If 
not, Case, Pollakowski and Wachter remark that the indices would be af- 
fected by sample selection bias. Englund, Quigley and Redfearn in a study 
of Swedish home sales, and Meese and Wallace (1997), in a study of Oakland 
and Freemont home sales, both found that repeat sales homes are indeed dif- 
ferent from single sale homes. Both studies also observed that in addition to 
being older, repeat sales homes were smaller and more "modest" [Englund, 
Quigley and Redfearn (1999)]. Therefore, repeat sales indices seem to pro- 
vide information only about a very specific type of home and may not apply 
to the entire housing market. However, published indices do not seem to be 
interpreted in that manner. Case and Quigley (1991) propose an alternative 
hybrid model that combines repeat sales methodology with hedonic infor- 
mation which makes use of all sales. While the index constructed with this 
method represents all home sales, it requires housing characteristics which 
may be difficult to collect on a broad scale. 

We feel the repeat sales concept is valuable although the current models 
of this type have the issues described above. The proposed model applies the 
repeat sales idea in a new way to address some of the criticisms while still 
maintaining the simplicity and reduced data requirements that the original 
Bailey et al. method had. While our primary goal is prediction, we believe 
the resulting index could be a better general description of housing sales 
than traditional repeat sales methodology. 

In our method, log prices are modeled as the sum of a time effect (index), 
a location effect modeled as a random effect for ZIP (postal) code, and 
an underlying first-order autoregressive time series [AR(1)]. This structure 
offers four advantages. First, the price index is estimated with all sales: single 
and repeat. Essentially, the index can be thought of as a weighted sum of 
price information from single and repeat sales. The latter component receives 
a much higher weight because more useful information is available for those 
homes. Second, the previous sale price becomes less useful the longer it has 
been since the last sale. The AR(1) series includes this feature into the model 
more directly than the Case-Shiller method. Third, metropolitan areas are 
diverse and neighborhoods may have disparate trends. We include ZIP code 
effects to model these differences in location.'^ Finally, the proposed model 



^ZIP code was readily available in our data; other geographic variables at roughly this 
scale might have been even more useful had they been available. 
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is straightforward to interpret even while including the features described 
above. We believe the model captures trends in the overall housing market 
better than existing repeat sales methods and is a practical alternative. 

We apply this model to data on single family home sales from July 1985 
through September 2004 for twenty US metropolitan areas. These data are 
described in Section 2. The autoregressive model is outlined and estimation 
using maximum likelihood is described in Section 3; results are discussed 
in Section 4. For comparison, two alternative models are fit: a conventional 
mixed effects model and the method used in the S&P/Case-Shiller Home 
Price Index. As a quantitative way to compare the indices, the predictive 
capacity of the three methods are assessed in Section 5. In Section 6 we 
examine the case of Los Angeles, CA, where the proposed model does not 
perform as well. We end with a general discussion in Section 7. 

2. House price data. The data are comprised of single family home sales 
qualifying for conventional mortgages from the twenty US metropolitan ar- 
eas listed in Table 1. These sales occurred between July 1985 and September 
2004. Not included in these data are homes with prices too high to be con- 
sidered for a conventional mortgage or those sold at subprime rates. Note, 
however, that subprime loans were not prevalent during the time period cov- 
ered by our data. Similar data are used by Fannie Mae, Freddie Mac, and 
to construct the OFHEO Home Price Index. 

For each sale, the following information is available: address with ZIP 
code, month and year of sale, and price. To ensure adequate data per time 
period, we divide the sample period into three month intervals for a total 
of 77 periods, or quarters. We make an attempt to remove sales which are 
not arm's length by omitting homes sold more than once in a single quar- 
ter. Given the lack of hedonic information, we have no way of determining 
whether a house has changed substantially between sales. Therefore, we do 
not filter our data to remove such houses. 

Table 2 displays the number of sales and unique houses sold in the sample 
period for a selection of cities. Complete tables for all summaries in this 
section are provided in Appendix A. Observe that the total number of sales 
is always greater than the number of houses because houses can sell multiple 



Table 1 
Metropolitan areas in the data 



Ann Arbor, MI 
Atlanta, GA 



Kansas City, MO 
Lexington, KY 
Los Angeles, CA 
Madison, WI 
Memphis, TN 



Minneapolis, MN 
Orlando, FL 
Philadelphia, PA 
Phoenix, AZ 
Pittsburgh, PA 



Raleigh, NC 
San Francisco, CA 
Seattle, WA 
Sioux Falls, SD 
Stamford, CT 



Chicago, IL 



Columbia, SC 
Columbus, OH 
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Table 2 

Summary counts for a selection of cities 



Metropolitan area 


Sales 


Houses 


Stamford, CT 


14,602 


11,128 


Ann Arbor, MI 


68,684 


48,522 


Pittsburgh, PA 


104,544 


73,871 


Los Angeles, CA 


543,071 


395,061 


Chicago, IL 


688,468 


483,581 



times (repeat sales). Perhaps more illuminating is Table 3, where we count 
the number of times each house is sold. We see that as the number of sales 
per house increases, the number of houses reduces rapidly. Nevertheless, a 
significant number of houses sell more than twice. With a sample period of 
nearly twenty years, this is not unusual; however, single sales are the most 
common despite the long sample period. The first column of Table 3 shows 
this clearly. Moreover, this pattern holds for all cities in our data. Finally, 
in Figure 1, we plot the median price across time for the subset of cities. 
This graph illustrates that both the cost of homes and the trends over time 
vary considerably across cities. 

For all metropolitan areas in our data, the time of a sale is fuzzy, as there 
is often a lag between the day when the price is agreed upon and the day 
the sale is recorded (around 20-60 days). Theoretically, the true value of 
the house would have changed between these two points. Therefore, in the 
strictest sense, the sale price of the house does not reflect the price at the 
time when the sale is recorded. Dividing the year into quarters reduces the 
importance of this lag effect. 

3. Model. The log house price series is modeled as the sum of an index 
component, an effect for ZIP code (as an indicator for location), and an 
AR(1) time series. The sale prices of a particular house are treated as a 
series of sales: yi^i,z,yi,2,z, ■ ■ ■ ,yi,j,z, ■ ■ ■ , where Vij^z is the log sale price of 



Table 3 

Sale frequencies for a selection of cities 



Metropolitan area 


1 sale 


2 sales 


3 sales 


4+ sales 


Stamford, CT 


8,200 


2,502 


357 


62 


Ann Arbor, MI 


32,458 


12,662 


2,781 


621 


Pittsburgh, PA 


48,618 


20,768 


3,749 


718 


Los Angeles, CA 


272,258 


100,918 


18,965 


2,903 


Chicago, IL 


319,340 


130,234 


28,369 


5,603 
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Median Price Over Time 




£ o-L^ ^ ^ ^ ^ 

Jul 1985 Apr 1990 Jan 1995 Oct 1999 Jul 2004 

Quarter 

Fig. 1. Median prices for a selection of cities. 

the jth sale of the ith house in ZIP code z. Note that yi^i^z is defined as 
the first sale price in the sample period; as a result, both new homes and 
old homes sold for the first time in the sample period are indicated with the 
same notation. 

Let there be 1, . . . , T discrete time periods where house sales occur. Allow 
t{i,j,z) to denote the time period when the jth sale of the ith house in 
ZIP code z occurs and let z) = t{i,j, z) — t{i,j — 1, z), or the gap time 

between sales. Finally, there are a total of = "^ILi Ji observations 

in the data where there are Z ZIP codes, houses in each ZIP code and Jj 
sales for a given house. 

The log sale price yij,z can now be described as follows: 

yi,l,z = ^ + f3t{i,l,z) + Tz+ EiXz, j = 1, 

(1) yi,j,z = ^ + j» +Tz + {yi,j-l,z - At - Ptii,j-l,z) - Tz) 

~l~ Ei,j,zi J ^ 1) 

where: 

1. The parameter fit{i,j,z) is the log price index at time t{i,j, z) . Let /3i,. . . ,/3t 
denote the log price indices, assumed to be fixed effects. 

2. ^ is the autoregressive coefficient and < 1. 

3. Tz is the random effect for ZIP code z. Tz M{0, o"^) where ri, . . . , are 
the ZIP code random effects which are distributed normally with mean 
and variance o"^ and where i.i.d. denotes independent and identically 
distributed. 

4. We impose the restriction that Ylt=i '^tPt = where nt is the number of 
sales at time t. This allows us to interpret /i as an overall mean. 

5. Finally, let 

.rf \ .r( Cj2(l-,^27(W))\ 
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and assume that all £ij,z are independent. 

Note that there is only one process for the series 2/1,2,25 ■ • ■ • The error 
variance for the first sale, cr^/(l — (/)^), is a marginal variance. For subsequent 
sales, because we have information about previous sales, it is appropriate 
to use the conditional variance (conditional on the previous sale), o"^(l — 
027(w))/(i_^2)^ instead. For more details refer to the supplemental article 
[Nagaraja, Brown and Zhao (2010)]. 

The underlying series for each house is given by = yi,j,z — fJ- — f^t{i,j,z) ~ 
Tz- We can rewrite this series as Ui,j,2 = 4>^^'^'^''^^Ui,j-i,z + £i,j,z where Sij^z 
is as given above. This autoregressive series is stationary, given a starting 
observation because E[uij^z] = 0, a constant, where E[-] is the expec- 

tation function, and the covariance between two points depends only on the 
gap time and not on the actual sale times. Specifically, Cov{uij^z,Ui,j',z) = 
fj2(i){t{i,j',z)-t{t,j,z)) ^^-^ — 0^) if j < j' . Therefore, the covariance between a 
pair of sales depends only on the gap time between sales. Consequently, the 
time of sale is uninformative for the underlying series, only the gap time is 
required. As a result, the autoregressive series Uij^z where i and z are fixed 
and J > 1 is a Markov process. 

The autoregressive component adds two important features to the model. 
Intuitively, the longer the gap time between sales, the less useful the previous 
price should become when predicting the next sale price. For the model 
described in (1), as the gap time increases, the autoregressive coefficient 
decreases by construction [(p"'^^'^'^^), meaning that sales prices of a home 
with long gap times are less correlated with each other. (See Remark 3.1 at 
the end of this section for additional discussion on the form of (p.) Moreover, 
as the gap time increases, the variance of the error term increases. This 
indicates that the information contained in the previous sale price is less 
useful as the time between sales grows. 

To fit the model, we formulate the autoregressive model in (1) in matrix 
form: 

(2) y = X/3 + ZT + £*, 

where y is the vector of log prices and X and Z are the design matrices for the 
fixed effects /3 = [///3i ■ • • I3t-i]' and random effects r, respectively. Then, 
the log price can be modeled as a mixed effects model with autocorrelated 
errors, e* , and with covariance matrix V. 

We apply a transformation matrix T to the model in (2) to simplify the 
computations; essentially, this matrix applies the autoregressive component 
of the model to both sides of (2). It is an A'^ x AT matrix and is defined as 
follows. Let i(i,j,2),(i',j',2') be the cell corresponding to the {i,j,z)t]i row and 
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{i' ,j' ,z')t]i column. Then, 

{1, iii = i', j =j', z = z', 

_07(m)^ if i = i', j=j' + l, 2; = z', 

0, otherwise. 

As a result, Te* ^Af{0, j^^j diag(r)) where diag(r) is a diagonal matrix of 
dimension N with the diagonal elements r being given by 



(4) 



1, when j = 1, 

l_(^27(ij)^ whenj>l. 



Using the notation from (1), let £ = Te*. Finally, we restrict Ylt=i "^^tPt = 
where nt is the number of sales at time t. Therefore, /3t = —^^22it=i''^A- 
The likelihood function for the transformed model is 

L(0;y) = (27r)-^/2|V|-i/2 

(5) 

X exp{-l(T(y - X/3))'V-i(T(y - X/3))}, 

where 6 = {/3, cr^, ci^, 0} is the vector of parameters, N is the total number 
of observations, V is the covariance matrix, and T is the transformation 
matrix. We can split V into a sum of the variance contributions from the 
time series and the random effects. Specifically, 

(6) V = diag(r) + (TZ)D(TZ)', 

where D = cr^Iz and \z is an identity matrix with dimension Z ^ Z. 

We use the coordinate ascent algorithm to compute the maximum likeli- 
hood estimates (MLE) of 6 for the model in (1). This iterative procedure 
maximizes the likelihood function with respect to each group of parameters 
while holding all other parameters constant. The algorithm terminates when 
the parameter estimates have converged according to the specified stopping 
rule. Bickel and Doksum (2001) include a proof showing that, for models in 
the exponential family, the estimates computed using the coordinate ascent 
algorithm converge to the MLE. The proposed model, however, is a member 
of the differentiable exponential family; therefore, as Brown (1986) states, 
the proof does not directly apply. Nonetheless, we find empirically that the 
likelihood function is well behaved, so the MLE appears to be reached for 
this well. Empirical evidence of convergence can be found in the 

supplemental article [Nagaraja, Brown and Zhao (2010)]. 

We outline Algorithm 1 below. The equations for updating the parameters 
and random effects estimates are given in Appendix B. 

To predict a log price, we substitute the estimated parameters and random 
effects into (1): 

(7) yij,z = A + h{i,j,z) +Tz + 4>'^'^'-'^'^\yi,j-l,z - A - ^t{i,j-l,z) - Tz)- 
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We then convert jjij^z to the price scale (denoted as using 

(8) j,2 (o"^ ) = exp I ijij^^ + Y I ' 

where o"^ denotes the variance of Vij^z- The additional term (7^/2 approxi- 
mates the difference between £'[exp{X}] and exp{i?[X]} where E[-] is the 
expectation function. We must adjust the latter expression to approximate 
the conditional mean of the response, y. We improve the efficiency of our 
estimates by using the adjustment stated in Shen, Brown and Zhi (2006). 
In (8), 0"^ can be estimated from the mean squared residuals (MSR), where 
MSR = jq^i=\{yi,j,z — yi,j,z)'^ ^i^d N is the total number of observations 
used to fit the model. Therefore, the log price estimates, jji^j^zi are converted 
to the price scale by 

(9) yi,j> = exp| 2/i J- ^ + \ . 

Goetzmann (1992) proposes a similar transformation for the index values 
computed using a traditional repeat sales method. Calhoun (1996) suggests 
applying Goetzmann's adjustment when using an index value to predict a 
particular house price. For the autoregressive model, the standard error of 
the index is sufficiently small that the efficiency adjustment has a negligible 
impact on the estimated index. Therefore, we simply use exp{/3t} to convert 
the index to the price scale. Finally, we rescale the vector of indices so that 
the first quarter has an index value of 1. 

Remark 3.1. The autoregressive coefficient form, (l>"/(^'^'^) ^ deserves fur- 
ther explanation. For each house indexed by {i,z), let ti{i,z) =t{i,l,z) 



Algorithm 1 Autoregressive (AR) model fitting algorithm. 

1. Set a tolerance level e (possibly different for each parameter). 

2. Initialize the parameters: 9^ = {f3^,a£' ji;^*^}- 

3. For iteration A; (fe = when the parameters are initialized), 

(a) Calculate using (19) in Appendix B with {a1' ~'^,<Jr' 

(b) Compute ae'^ by computing the zero of (20) using {/3*'', Ur'^""'^, </''^~^}. 

(c) Compute Ur'^ by calculating the zero of (21) using {/3*'', (7^''^, c/)'^"^}. 

(d) Find the zero of (22) to compute (j)'' using {(3^ ,a£'^ ,ar'''}- 

(e) If \0'l~^ - 0f I > e for any 6i e 6, repeat step 3 after replacing O''^^ 
with O'' . Otherwise, stop (call this iteration K). 

4. Solve for /3t by computing: /3t = T.t=i ^t^f ■ 

5. Plug in {/3^, o"r'^, (/>^^} to compute the estimated values for r using 
(23). 
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denote the time of the initial sale. Conditioning on the (unobserved) values 
of the parameters {/i, o"^, cj^} and on the values of the random ZIP code 
effects, {tz}, let {ui^z]t-t = tiii,z),ti{i,z) + l,...} be an underlying AR(1) 
process. To be more precise, Ui^z;t is a conventional, stationary AR(1) process 
defined by 



(10) Ui^z;t 



£i,i,z, ift = ti{i,z), 

(t)Ui^z;t-i + ei,i,z, \it>ti{i,z), 



where if t = t{i,j,z), then ei^z;t(i,j,z) =^i,j,z and otherwise ei^z;t A/'(0, jz^)- 
Then the observed log sale prices are given by {yij^z} where Ui^z;t{i,j,z) = 
yi,j,z — (^ + Pt{i,i,z) + Tz)- The values of Ui^z;t are to be interpreted as the 
potential sale price adjusted by {fi, I3t,a'^,a'^} of the house indexed by {i,z) 
if the house were to be sold at time t. 

For housing data like ours, the value of the autoregressive parameter (p 
for this latent process will be near the largest possible value, (p = 1. Con- 
sequently, if the underlying process were actually an observed process from 
which one wanted to estimate (p, then estimation of (j) could be a delicate 
matter. However, sales generally occur with fairly large gap times and so 
the values of cp^'^'^'^^ occurring in the data will generally not be close to 1. 
For that reason, conventional estimation procedures perform satisfactorily 
when estimating (p. We provide empirical evidence for this in Section 4 and 
in the supplemental article [Nag Brown and Zhao (2010)]. 

4. Estimation results. To fit and validate the autoregressive (AR) model, 
we divide the observations for each city into training and test sets. The test 
set contains all final sales for homes that sell three or more times. Among 
homes that sell twice, the second sale is added to the test set with probability 
1/2. As a result, the test set for each city contains roughly 15% of the sales. 
The remaining sales (including single sales) comprise the training set. Table 
8 in Appendix A lists the training and test set sizes for each city. We fit 
the model on the training set and examine the estimated parameters. The 
test set will be used in Section 5 to validate the AR model against two 
alternatives. 

In Table 4, the estimates for the overall mean /i (on the log scale), the 
autoregressive parameter cp, the variance of the error term cr^, and the vari- 
ance of the random effects are provided for each metropolitan area. As 
expected, the most expensive cities have the highest values of /i: Los Ange- 
les, CA, San Francisco, CA, and Stamford, CT. In Figure 2, the indices for 
a sample of the twenty cities are provided. There are clearly different trends 
across cities. 

The estimates for the AR model parameter (p are close to one. This is not 
surprising as the adjusted log sale prices, for sale pairs with short gap 
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Table 4 

Parameter estimates for the AR model 



Metropolitan area 




A 


4> 


- 2 


- 2 


Ann Arbor, MI 


11 


6643 


0.993247 


0.001567 


0.110454 


Atlanta, GA 


11 


6882 


0.992874 


0.001651 


0.070104 


Chicago, IL 


11 


8226 


0.992000 


0.001502 


0.110683 


Columbia, SC 


11 


3843 


0.997526 


0.000883 


0.028062 


Columbus, OH 


11 


5159 


0.994807 


0.001264 


0.090329 


Kansas City, MO 


11 


4884 


0.993734 


0.001462 


0.121954 


Lexington, KY 


11 


6224 


0.996236 


0.000968 


0.048227 


Los Angeles, CA 


12 


1367 


0.981888 


0.002174 


0.111708 


Madison, WI 


11 


7001 


0.994318 


0.001120 


0.023295 


Mempiiis, TN 


11 


6572 


0.994594 


0.001120 


0.101298 


Minneapolis, MN 


11 


8327 


0.992008 


0.001515 


0.050961 


Orlando, FL 


11 


6055 


0.993561 


0.001676 


0.046727 


Philadelphia, PA 


11 


7106 


0.991767 


0.001679 


0.183495 


Phoenix, AZ 


11 


7022 


0.992349 


0.001543 


0.106971 


Pittsburgh, PA 


11 


3408 


0.992059 


0.002546 


0.103488 


Raleigh, NC 


11 


7447 


0.993828 


0.001413 


0.047029 


San Francisco, CA 


12 


4236 


0.985644 


0.001788 


0.056201 


Seattle, WA 


11 


9998 


0.989923 


0.001658 


0.039459 


Sioux Falls, SD 


11 


6025 


0.995262 


0.001120 


0.032719 


Stamford, CT 


12 


5345 


0.987938 


0.002294 


0.093230 



times are expected to be closer in value than those with longer gap times. 
It may be tempting to assume that since (p is so close to 1, the prices form 
a random walk instead of an AR(1) time series (see Remark 3.1). However, 
this is clearly not the case. Recall that (p enters the model not by itself but 
as where ^{i,j,z) is the gap time. These gap times are high enough 

that the correlation coefficient is considerably lower than 1. The 
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Columbus, OH 
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Fig. 3. Checking the AR(1) assumption for Columbus, OH. 

mean gap time across cities is around 22 quarters. As an example, for Ann 
Arbor, MI, 4?'^ = 0.99324722 0.8615 which is clearly less than 1. Therefore, 
the types of sensitivity often produced as a consequence of near unit roots 
do not apply to our autoregressive model. 

We have modeled the adjusted log prices, Uij^z = yi,j,z — Pt(i,j,z) — Tz, as 
a latent AR(1) time series. Accordingly, for each gap time, 7(i,j, z) = h, 
there is an expected correlation between the sale pairs: (p^. To check that 
the data support the theory, we compare the correlation between pairs of 
quarter-adjusted log prices at each gap length to the correlation predicted 
by the model. 

First, we compute the estimated adjusted log prices Uij^z = yi,j,z — f^t{i,j,z) ~ 
Tz for the training data. Next, for each gap time h, we find all the sale pairs 
{uij-i^z,Uij^z) with that particular gap length. The sample correlation be- 
tween those sale pairs produces an estimate of (j) for gap length h. If we re- 
peat this procedure for each possible gap length, we should obtain a steady 
decrease in the correlation as gap time increases. In particular, the points 
should follow the curve (f)^ if the model is specified correctly. 

In Figure 3, we plot the correlation of the adjusted log prices by gap time 
for Columbus, OH. Note that the computed correlations for each gap time 
were computed with varying quantities of sale pairs. Those computed with 
fewer than twenty sale pairs are plotted as blue triangles. We also overlay 
the predicted relationship between (j) and gap time. The inverse relationship 
between gap time and correlation seems to hold well and we obtain similar 
results for most cities. One notable exception is Los Angeles, CA, which we 
discuss in Section 6. 

5. Model validation. To show that the proposed AR model produces 
good predictions, we fit the model separately to each of the twenty cities and 
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apply the fitted models to each test set. For comparison purposes, a mixed 
effects model along with the benchmark S&P/Case-Shiller model is applied 
to the data. The former model is a simple, but reasonable, alternative to the 
AR model. Both models are described below. In addition to the predictions, 
we compare the price indices and training set residuals. 

The root mean squared error (RMSE)'^ is used to evaluate predictive 
performance for each city in Section 5.3. We will see that the AR model 
provides the best predictions. In addition, we will show the results from 
Columbus, OH as a typical example. 

5.1. Mixed effects model. A mixed effects model provides a very simple, 
but plausible, approach for modeling these data. This model treats the time 
effect {j3t) as a fixed effect, and the effects of house (ctj) and ZIP code (r^) 
are modeled as random effects. There is no time series component to this 
model. We describe the model as follows: 

(11) yi,j,z = fJ. + ai + Tz + l^t{i,j,z) + £i,j,z, 

where ' A/'(0, u^), 7V(0, cr^), and e^j^^ 7\A(0, fig) for houses i 

from 1, . . . ,Iz, sales j from 1, . . . , Jj, and ZIP codes z from 1, . . . , Z. As before, 
^ is a fixed parameter and /3ij^z is the fixed effect for time. The estimates for 
the parameters 9 = {fi, P,a'^,a'^} are computed using maximum likelihood 
estimation. 

Finally, estimates for the random effects a and r are calculated by iter- 
atively calculating the following: 

,2 \ -1 



(12) a= (^^Ij + W'W^ W'(y-X^-ZT), 

(13) f = (^^Iz + Z'Z^ Z'(y - - Wa), 

where X and W are the design matrices for the fixed and random effects 
respectively and y is the response vector. These expressions are derived using 
the method of computing BLUP estimators outlined by Henderson (1975). 
To predict the log price, jjij^z, we substitute the estimated values: 

(14) yi,j,z = fi + ^t{i,j,z) + ai + Tz. 

We use transformation (9) to convert these predictions back to the price 
scale. Finally, we construct a price index similar to the autoregressive case. 
Therefore, as in Figure 2, the values of exp{/3t} are rescaled so that the price 
index in the first quarter is 1. 



*RMSE= ^iELi(n 
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5.2. S&P/Case~Shiller model. The original Case and Shiller (1987, 1989) 
model is a repeat-sales model which expands upon the Bailey, Muth and 
Nourse (1963) setting by accounting for heteroscedasticity in the data due to 
the gap time between sales. Borrowing some of their notation, the framework 
for their model is 

(15) yi,t = I3t + Hi^t + Ui,t, 

where yi^t is the log price of the sale of the ith house at time t, (3t is the log 
index at time t, and Ui^t A/'(0,it^). The middle term, Hi^t, is a Gaussian 
random walk which incorporates the previous log sale price of the house. 
Location information, such as ZIP codes, are not included in this model. 
Like the Bailey, Muth and Nourse setup, the Case and Shiller setting is a 
model for differences in prices. Thus, the following model is fit: 

t' 

(16) - yi,t = /3t' - Pt+ ^ Vi^k + Ui^t' - Ui^t, 

k=t+l 

where t' > t. The random walk steps are normally distributed where Vi^k ' ^ 
A/'(0,<T^). Weighted least squares is used to fit the model to account for both 
sources of variation. 

The S&P/Case-Shiller procedure follows in a similar vein but is fit on 
the price scale instead of the log price scale. The procedure is similar to 
the arithmetic index proposed by Shiller (1991) which we will describe next; 
however, full details are available in the S&P/Case-Shiller® Home Price 
Indices: Index Methodology (2009) report. Let there be S sale pairs, con- 
sisting of two consecutive sales of the same house, and T time periods. An 
S X (T — 1) design matrix X, an S x (T — 1) instrumental variables (IV) ma- 
trix Z, and an 5 X 1 response vector w are defined next. Let the subscripts 
s and t denote the row and column index respectively. Finally, let Yg^t be 
the sale price (not log price) of the house in sale pair s at time t. Therefore, 
in each sale pair, there will be two prices Yg^t and Yg^t' where t ^ t'. The 
matrices X, Z and vector w where s indicates the row and t indicates the 
column are now defined as follows: 

{—Yg^t, if first sale of pair s is at time t, t> 1, 
Yg^t, if second sale of pair s is at time t, 
0, otherwise, 

{—1, if first sale of pair s is at time t, t > 1, 

1, if second sale of pair s is at time t, 

0, otherwise. 



Yg^t, first sale of pair s at time 1, 
0, otherwise. 
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The goal is to fit the model w = Xb + e where b = (61 • • • bx)' is the vector 
of the reciprocal price indices. That is, i?^ = 1/bt is the price index at time t. 
A three-step process is implemented to fit this model. First, b is estimated 
using regression with instrumental variables. Second, the residuals from this 
regression are used to compute weights for each observation. Finally, b is es- 
timated once more while applying the weights. This process, outlined in full 
in the S&P/Case-Shiller® Home Price Indices: Index Methodology report, 
is described below: 

1. Estimate b by running a regression using instrumental variables: b = 
(Z'X)-i X Z'w. 

2. Calculate the weights for each observation using the squared residuals 
from the first step. These weights are dependent on the gap time between 
sales. We denote the residual as ii which is an estimate of Uj — Ui^t + 

Z]l=i "^^^ expectation of Ei is E[ui^t' — Ui^t + ^1=1 ^i,fc] = and the 
variance is Var[uj — ui^t + Yl\=i '"i,k\ = 20"^ + {t' — t)a1. To compute the 
weights for each observation, the squared residuals from the first step are 
regressed against the gap time. That is, 

(17) if = ^+ Q^it' - t) + r]i, 

where E[rii\ = 0. The reciprocal of the square root of the fitted values from 
the above regression are the weights. Using their notation, we denote this 
weight matrix by 

3. The final step is to estimate b again while incorporating the weights, fi: 
b = (Z'ri~^X)~^Z'ri~^w. The indices are simply the reciprocals of each 
element in b for t > 1 and, by construction, Bi = l. 

Finally, to estimate the prices in the test set, we simply calculate 

(18) y,, = S^y,,_i, 

where Yij is the price of the jth sale of the ith house and Bt is the price 
index at time t. We do not apply the correction proposed by Goetzmann 
when estimating prices because it is appropriate only for predictions on the 
log price scale. The S&P/Case-Shiller method is fit on the price scale so no 
transformation is required. 

5.3. Comparing predictions. We fit all three models on the training sets 
for each city and predict prices for those homes in the corresponding test set. 
The RMSE for the test set observations is calculated in dollars for each model 
in order to compare performance across models. These results are listed in 
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Table 5. The model with the lowest RMSE value for each city is shown 
in italicized font. Note that while the S&P/Case-Shiller method produces 
predictions directly on the price scale, the autoregressive and mixed effects 
models must be converted back to the price scale using (9). It is clear that 
the AR model performs better than the S&P/Case-Shiller model for all of 
the cities, reducing the RMSE by up to 21% in some cases; the AR model 
produces lower RMSE values when compared to the mixed effects model 
as well for nearly all cities, San Francisco, CA, being the only exception. 
Moreover, the AR model performs better under alternate loss functions as 
well, which we show in the supplemental article [Nagaraja, Brown and Zhao 
(2010)]. 

Note that the RMSE value is missing for Kansas City, MO for the S&P / Case- 
Shiller model. Some of the observation weights calculated in the second step 
of the procedure were negative, halting the estimation process. This is an- 
other drawback to some of the existing repeat sales procedures. Calhoun 
(1996) suggests replacing the sale specific error Ui^t [as given in (16)] with 
a house specific error Ui] however, this fundamentally changes the structure 
of the error term and, as a result, the fitting process. Furthermore, it is not 
implemented in the S&P/Case-Shiller methodology. Therefore, we do not 
apply it to our data. 



Table 5 

Test set RMSE for three models (in dollars) 



Metropolitan area 


AR (local) 


Mixed effects (local) 


S&P/C-S 


Ann Arbor, MI 


41,401 


46,519 


52,718 


Atlanta, GA 


30,914 


34,912 


35,482 


Chicago, IL 


36,004 




42,865 


Columbia, SC 


35,881 


38,375 


42,301 


Columbus, OH 


27,353 


30,163 


30,208 


Kansas City, MO 


24,179 


25,851 




Lexington, KY 


21,132 


21,555 


21,731 


Los Angeles, CA 


37,438 




41,951 


Madison, WI 


28,035 


30,297 


30,640 


Memphis, TN 


24,588 


25,502 


25,267 


Minneapolis, MN 


31,900 


34,065 


34,787 


Orlando, FL 


28,449 


30,438 


30,158 


Philadelphia, PA 


33,246 




35,350 


Phoenix, AZ 


28,247 


29,286 


29,350 


Pittsburgh, PA 


26,406 


28,630 


30,135 


Raleigh, NC 


25,839 


27,493 


26,775 


San Francisco, CA 


49,927 


48,217 


50,249 


Seattle, WA 


38,469 


41,950 


43,486 


Sioux Falls, SD 


20,160 


21,171 


21,577 


Stamford, CT 


57,722 


58,616 


68,132 
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Fig. 4. Comparing the variance of the residuals for Columbus, OH. 



Three values are also missing in Table 5 for the mixed effect model results. 
For these three cities, the iterative fitting procedure failed to converge. We 
can attribute this to the size of these data and, more importantly, that the 
data do not conform well to the mixed effects model structure. 

Next, we will examine several diagnostic plots to assess whether the model 
assumptions are satisfied for each method. We begin by investigating the 
variance of the residuals. As the gap time increases, we expect a higher 
error variance indicating that the previous price becomes less useful over 
time. The proposed autoregressive model and the S&P/Case-Shiller model 
each incorporate this feature differently, using an underlying AR(1) time 
series and a random walk respectively. The mixed effects model, however, 
assumes a constant variance regardless of gap time. In Figure 4, for each 
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Normal Q-Q Plot 
(AR Model) 



Normal Q-Q Plot 
(Mixed Effects Model) 
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Fig. 5. Normality of ZIP code effects for Columbus, OH. 



model, we plot the variance of the predictions by gap time for the training 
set residuals.^ The expected variance by gap time values using the estimated 
parameters is then overlaid. The autoregressive and mixed effects models are 
fit on the log price scale, whereas the S&P/Case-Shiller model is fit on the 
price scale. Therefore, the residual plots are graphed on very different scales. 

There are two features to note here. The first is that heteroscedasticity 
is clearly present: the variance of the residuals does in fact increase with 
gap time. The second feature is that while none of the methods perfectly 
model the heteroscedastic error, the mixed effects model is undoubtedly 
the worst. This pattern holds across all of the cities in the data set. Both 
the autoregressive and S&P/Case-Shiller models seem to have lower than 
expected variances in Figure 4. 

For both the AR and mixed effects models, the random effects for ZIP 
codes are assumed to be normally distributed. As a diagnostic procedure, 
we construct the normal quantile plots of the ZIP code effects. The results 
are shown in Figure 5. Columbus, OH has a total of 103 ZIP codes, or 
random effects. We find the normality assumption appears to be reasonably 
satisfied for the mixed effects model but less so for the autoregressive model. 
Note, however, that each random effect is estimated using a different number 
of sales. This interferes with the routine interpretation of these plots. In 
particular, the outliers in both plots correspond to ZIP codes containing 



^Note that for these three plots, the term "residual" indicates the usual statistical 
residual values produced by applying the model and comparing the predictions with the 
response vector. For the AR and mixed effects models, these residuals are identical to the 
predictions on the log price scale discussed in previous sections; however, for the S&P/C-S 
model, this is not the case. 
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Columbus, OH 




0.5- 

o.o] 

JUI1985 Apr 1990 Jan 1995 Oct 1999 Jul 2004 
Quarter 

Fig. 6. House price indices for Columbus, OH. 

ten or fewer sales. Across all metropolitan areas, the normality assumption 
seems to be well satisfied in some cases and not so well in others, but with 
no clear pattern we could discern as to the type of analysis, size of the 
data or geographic region. The supplemental article contains results of the 
Shapiro- Wilk test for normality [Nagaraja, Brown and Zhao (2010)]. 

In Figure 6, we plot four indices for Columbus, OH: the AR index, the 
mixed effects index, the S&P/Case-Shiller index, and the mean price index. 
The mean index is simply the average sale price at each quarter rescaled so 
that the first index value is 1. From the plot, we see that the autoregressive 
index is generally between the S&P/Case-Shiller index and the mean index 
at each point in time. The mean index treats all sales as single sales. That 
is, information about repeat sales is not included; in fact, no information 
about house prices is shared across quarters. The S&P/Case-Shiller index, 
on the other hand, only includes repeat sales houses. The autoregressive 
model, because it includes both single sales and repeat sales, is a mixture of 
the two perspectives. Essentially, the index constructed from the proposed 
model is a measure of the average house price placing more weight to those 
homes which have sold more than once. 

6. The case of Los Angeles, CA. Even though the autoregressive model 
has a lower RMSE than the S&P/Case-Shiller model for Los Angeles, CA, 
it does not seem to fit the data well. If we examine Figure 7, a plot of the 
correlation against gap time, we immediately see two significant issues when 
what is expected (line) is compared with what the data indicate (dots). 
First, the value of 4> is not as close to 1 as expected. Second, the pattern 
of decay, also does not follow the presumed pattern. We will focus 
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Los Angeles, CA 

1.0 - 

0.95 - 
0.90 - 
0.85 

8 - 
0.6 - 
0.4 - 
0.2 - 
0.0 - 

6 20 40 60 

Gap Time {quarters) 

Fig. 7. Problems with the assumptions. 

on Los Angeles, CA, and discuss these two issues for the remainder of this 
section. 

We expect (j) to be close to 1; however, for Los Angeles, CA, this does not 
seem to be the case. In fact, according to the data, for short gap times, the 
correlation between sale pairs seems to be far lower than one. To investigate 
this feature, we examine sale pairs with gap times between 1 and 5 quarters 
more closely. In Figure 8, we construct a histogram of the quarters where 
the second sale occurred for this subset of sale pairs. We pair this histogram 
with a plot of the price index for Los Angeles, CA. Most of these sales 
occurred during the late 1980s and early 1990s. This corresponds to the 
same period when Sing and Furlong (1989) found that lenders were offering 
people mortgages where the monthly payment was greater than 33% of their 
monthly income. The threshold of 33% is set to help ensure that people will 
be able to afford their mortgage. Those persons with mortgages that exceed 
this percentage tend to have a higher probability of defaulting on their 
payments. 

Bates (1989) found that a number of banks including the Bank of Cal- 
ifornia and Wells Fargo were highly exposed to these risky investments, 
especially in the wake of the housing downturn during the early 1990s. If 
a short gap time is an indication that a foreclosure took place, this would 
explain why these sale pair prices are not highly correlated. We did observe, 
however, that other cities also experienced periods of decline, such as Stam- 
ford, CT (see Figure 2), but did not have anomalous autoregressive patterns 
like those in Figure 7 for Los Angeles, CA. 

Even if this were not the case, the autoregressive model may not be per- 
forming well simply because there was a downturn in the housing market. 
Most of the cities in our data cover periods where the indices are increasing- 
the model may be performing well only because of this feature. In the case 
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Fig. 8. Examining the housing downturn. 

of Los Angeles, CA, if we examine the period between January 1990 and 
December 1996 on Figure 8, the housing index was decreasing. However, if 
we calculate the RMSE of test set sales for this period only, we find that 
the autoregressive model still performs better than the S&P/Case-Shiller 
method. The RMSE values are $32,039 and $41,841, respectively. There- 
fore, the autoregressive model seems to perform better in a period of decline 
as well as in times of increase. 

The second irregularity evident in Figure 7 is that the AR(1) process 
does not decay at the same rate as the model predicts. In 1978 California 
voters, as a protest against rising property taxes, passed Proposition 13 
which limited how fast property tax assessments could increase per year. 
Galles and Sexton (1998) argue that Proposition 13 encouraged people to 
retain homes especially if they have owned their home for a long time. It is 
possible that this feature of Figure 7 is a long term effect of Proposition 13. 
On the other hand, it could be that California home owners tend to renovate 
their homes more frequently than others, reducing the decay in prices over 
time. However, we have no way of verifying either of these explanations given 
our data. 

7. Discussion. Two key tasks when analyzing house prices are predicting 
sale prices of individual homes and constructing price indices which mea- 
sure general housing trends. Using extensive data from twenty metropolitan 
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areas, we have compared our predictive method to two other methods, in- 
cluding the S&P/Case-Shiller Home Price Index. We find that on average 
the predictions using our method are more accurate in ah but one of the 
twenty metropohtan areas examined. 

Data such as ours often do not contain rehable hedonic information on 
individual homes, if at all. Therefore, harnessing the information contained 
in a previous sale is critical. Repeat sales indices attempt to do exactly that. 
Some methods have also incorporated ad hoc adjustments to take account 
of the gap time between the repeat sales of a home. In contrast, our model 
involves an underlying AR(1) time series which automatically adjusts for the 
time gap between sales. It also uses the home's ZIP code as an additional 
indicator of its hedonic value. This indicator has some predictive value, 
although its value is quite weak by comparison with the price in a previous 
sale if one has been recorded. 

The index constructed from our statistical model can be viewed as a 
weighted average of estimates from single and repeat sales homes, with the 
repeat sales prices having a substantially higher weight. As noted, the time 
series feature of the model guarantees that this weight for repeat sales prices 
slowly decreases in a natural fashion as the gap time between sales increases. 

Our results do not provide definitive evidence as to the value of our index 
when comparing with other currently available indices as a general economic 
indicator. Indeed, such a determination should involve a study of the eco- 
nomic uses of such indicators as well as an examination of their formulaic 
construction and their use for prediction of individual sale prices. We have 
not undertaken such a study, and so can offer only a few comments about 
the possible comparative values of our index. 

As we have discussed, we feel it may be an advantage that our index in- 
volves all home sales in the data (subject to the naturally occurring weight- 
ing described above), rather than only repeat sales. Repeat sales homes are 
only a small, selected fraction of all home sales. Studies have shown that re- 
peat sales homes may have different characteristics than single sale homes. 
In particular, they are evidently older on average, and this could be expected 
to have an effect on their sale price. Since our measure brings all home sales 
into consideration, albeit in a gently weighted manner, and since it provides 
improved prediction on average, it may produce a preferable index. 

Another advantage of our model is that it remains easy to interpret at 
both the micro and macro levels, in spite of including several features in- 
herent in the data. Future work seems desirable to understand anomalous 
features such as those we have discussed in the Los Angeles, CA, area. Such 
research may allow us to construct a more flexible model to accommodate 
such cases. For example, it could involve the inclusion of economic indica- 
tors which may affect house prices such as interest rates and tax rates and 
measures of general economic status such as the unemployment rate. 
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APPENDIX A: DATA SUMMARY 

Table 6 
Summary counts 



No. houses per sale count 



City 


No. sales 


No. houses 


1 


2 


3 


4+ 


Ann Arbor, MI 


68,684 


48,522 


32,458 


12,662 


2,781 


621 


Atlanta, GA 


376,082 


260,703 


166,646 


76,046 


15,163 


2836 


Chicago, IL 


688,468 


483,581 


319,340 


130,234 


28,369 


5,603 


Columbia, SC 


7,034 


4,321 


2,303 


1,470 


431 


117 


Columbus, OH 


162,716 


109,388 


67,926 


31,739 


7,892 


1,831 


Kansas City, MO 


123,441 


90,504 


62,489 


23,706 


3,773 


534 


Lexington, KY 


38,534 


26,630 


16,891 


7,901 


1,555 


282 


Los Angeles, CA 


543,071 


395,061 


272,258 


100,918 


18,965 


2,903 


Madison, WI 


50,589 


35,635 


23,685 


9,439 


2,086 


425 


Memphis, TN 


55,370 


37,352 


23,033 


11,319 


2,412 


587 


Minneapolis, MN 


330,162 


240,270 


166,811 


59,468 


11,856 


2,127 


Orlando, FL 


104,853 


72,976 


45,966 


22,759 


3,706 


543 


Philadelphia, PA 


402,935 


280,272 


179,107 


82,681 


15,878 


2,606 


Phoenix, AZ 


180,745 


129,993 


87,249 


35,910 


5,855 


968 


Pittsburgh, PA 


104,544 


73,871 


48,618 


20,768 


3,749 


718 


Raleigh, NC 


100,180 


68,306 


42,545 


20,632 


4,306 


818 


San Francisco, CA 


73,598 


59,416 


46,959 


10,895 


1,413 


149 


Seattle, WA 


253,227 


182,770 


124,672 


47,406 


9,198 


1,494 


Sioux Falls, SD 


12,439 


8,974 


6,117 


2,353 


419 


85 


Stamford, CT 


14,602 


11,128 


8,200 


2,502 


357 


62 



Table 7 
Number of ZIP codes by city 



City 


No. ZIP codes 


Ann Arbor, MI 


57 


Atlanta, GA 


184 


Chicago, IL 


317 


Columbia, SC 


12 


Columbus, OH 


103 


Kansas City, MO 


179 


Lexington, KY 


31 


Los Angeles, CA 


280 


Madison, WI 


40 


Memphis, TN 


64 


Minneapolis, MN 


214 


Orlando, FL 


96 


Philadelphia, PA 


330 


Phoenix, AZ 


130 
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Table 7 
( Continued.) 



City 


No. ZIP codes 


Pittsburgh, PA 


257 


Raleigli, NC 


82 


San Francisco, CA 


70 


Seattle, WA 


110 


Sioux Falls, SD 


30 


Stamford, CT 


23 



Table 8 
Training and test set sizes 



Autoregressive model S&P/Case-Shiller model 



City 


Training 


Test 


No. houses 


Training pairs 


No. houses 


Ann Arbor, MI 


58,953 


9,731 


48,522 


10,431 


9,735 


Atlanta, GA 


319,925 


56,127 


260,703 


59,222 


55,911 


Chicago, IL 


589,289 


99,179 


483,581 


105,708 


99,069 


Columbia, SC 


5,747 


1,287 


4,321 


1,426 


1,279 


Columbus, OH 


136,989 


25,727 


109,388 


27,601 


25,458 


Kansas City, MO 


107,209 


16,232 


90,504 


16,705 


16,092 


Lexington, KY 


32,705 


5,829 


26,630 


6,075 


5,748 


Los Angeles, CA 


470,721 


72,350 


395,061 


75,660 


72,338 


Madison, WI 


43,349 


7,240 


35,635 


7,714 


7,221 


Memphis, TN 


46,724 


8,646 


37,352 


9,372 


8,673 


Minneapolis, MN 


286,476 


43,686 


240,270 


46,206 


43,764 


Orlando, FL 


89,123 


15,730 


72,976 


16,147 


15,531 


Philadelphia, PA 


343,354 


59,581 


280,272 


63,082 


60,068 


Phoenix, AZ 


155,823 


24,922 


129,993 


25,830 


24,656 


Pittsburgh, PA 


89,762 


14,782 


73,871 


15,891 


14,956 


Raleigh, NC 


84,678 


15,502 


68,306 


16,372 


15,388 


San Francisco, CA 


66,527 


7,071 


59,416 


7,111 


6,948 


Seattle, WA 


218,741 


34,486 


182,770 


35,971 


34,304 


Sioux Falls, SD 


10,755 


1,684 


8,974 


1,781 


1,677 


Stamford, CT 


12,902 


1,700 


11,128 


1,774 


1,654 



APPENDIX B: UPDATING EQUATIONS 

In this section we provide the updating equations for estimating the pa- 
rameters 6 = {P,a'^,a^,(f>} in the autoregressive model (see Section 3). Ob- 
serve that the covariance matrix V is an x iV matrix where A^ is the 
sample size. Given the size of our data, it is simpler computationally to 
exploit the block diagonal structure of V. Each block, denoted by V^^^, 
corresponds to observations in ZIP code z. Computations are carried out 
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on the ZIP code level and the updating equations provided below reflect 
this. For instance, and are the elements of the log price vector and 
transformation matrix respectively for observations in ZIP code z. 
To start, an explicit expression for /3 can be formulated: 

(19) 15= ('^(T,X,)'V-iT,X,') ^(T,X,)'V-iT,y,. 

\z=\ / z=l 

Estimates must be computed numerically for the remaining parameters. As 
all of these are one-dimensional parameters, methods such as the Newton- 
Raphson algorithm are highly suitable. We first define = Yz — ^z(3 for 
clarity. To update a'^, compute the zero of 

z z 

(20) = - ^ tr(V-i diag(r,)) + ^(T,w,)'V-i diag(r,)V-^(T,w,), 

2 = 1 2 = 1 

where tr(-) is the trace of a matrix and diag(r) is as defined in (4). Similarly, 
to update o"^, find the zero of 

z 

= 5]tr(V-KT2l„J(T,l„J') 

2 = 1 

(21) 

+ 5](T.w.)'V-S(T.l„.)(T,l„.)'V-;(T.w,), 

2 = 1 

where denotes the number of observations in ZIP code z and 1^ is a 
{k X 1) vector of ones. 

Finally, to update the autoregressive parameter 0, we must calculate the 
zero of the function below: 

z 



2 = 1 ^ ^ ^ / 

'5(Ta„J\' 



+ <(T,lr 



cr2 0-2 gdiag(r^ 

diag(r2) + 



(1 _ ^2)2 - "bV 2, , ^ _ ^2 

z 



2 = 1 ^ ^ 2 = 1 ^ ^ 

(22) + X: [(T.w.)'V-i (^^^^) (T.lnJ' 
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+ 



■diag(r^) 



+ 



(l-(/.2)2 

0-2 9diag(r2) 



1 



V"i(T,w, 



After the estimates converge, we must estimate the random effects. We 
use Henderson's procedure to derive the Best Linear Unbiased Predictors 
(BLUP) for each ZIP code. His method assumes that the parameters in the 
covariance matrix, V, are known; however, we use the estimated values. The 
formula is 



(23) 



2al 



x((l-(A')(t,l,ydiag-i(r,)(t,w,)), 



where diag (r) is the inverse of the estimated diagonal matrix diag(r). 
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SUPPLEMENTARY MATERIAL 

Supplement to "An autoregressive approach to house price modeling" 

(DOI: 10.1214/10-AOAS380SUPP; .pdf). This supplement contains extra 
analysis on a variety of topics related to the paper from examining the con- 
vergence of the coordinate ascent algorithm, or applying alternate loss func- 
tions, to studying the impact of each feature included in the autoregressive 
(AR) model. 
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